The aim of this report is to present here our exploratory data analysis, visualizations and other interesting insights into the Airbnb data. We focus on Beijing’s data and wish to perform an in-depth analysis on this one of the most densely populated cities in the world, which helps customers find a value-for-money living condition, and if possible, some special local experiences. In this report, we aim to answer the following questions:
Also, please see our presentation BUSI97273Visualisation_Group15-Presentation.
In this section, we will detail our analysis to the questions of interest mentioned in the introduction and gain preliminary insights through exploratory data analysis and visualization. We have divided it into three subsections that aim to answer the questions through a variety of different visualization.
library(readr)
library(dplyr)
library(ggplot2)
library(choroplethr)
library(GGally)
library(lubridate)
library(zoo)
library(scales)
library(ggmap)
library(scales)
library(stringr)
library(leaflet)
library(gridExtra)
library(maps)
library(maptools)
library(sp)
library(rgdal)
library(broom)
library(ggthemes)
library(mapproj)
source("https://raw.githubusercontent.com/iascchen/VisHealth/master/R/calendarHeat.R")
options(scipen=999)
library(tidyverse)
library(htmlwidgets)
library(text2vec)
library(tm)
library(wordcloud)
library(tmap)
library(readxl)
library(dplyr)
library(scales)
library(gridExtra)
library(alluvial)
library(tm)
library(RColorBrewer)
library(wordcloud)
library(hrbrthemes)
library(viridis)
library(plotly)
Sys.setlocale("LC_TIME","English")
Before data analysis, first, we load all the excel sheets downloaded from Inside Airbnb, including listings, reviews, calendar and neighborhoods.
listings <- read_csv("listings.csv")
listings2 <- read_csv("listings-2.csv")
reviews <- read_csv("reviews.csv")
calendar <- read_csv("calendar.csv")
neighborhood <- read_csv("neighbourhoods.csv")
listings$price <- as.numeric(
gsub(",", "",substring(listings$price, 2)))
This section will explore the location distribution of all kinds of properties in Beijing, average reviews based on a relatively narrow area, and average prices in that area.
This map below provides a zoom-view of the locations and clusters of Airbnb accommodation in Beijing by price, room type and property type.
As we can see from the zoom-view screenshot of a map, the property is evenly distributed, which means you can find an Airbnb accommodation almost anywhere in Beijing. We notice that although distributed in the whole city, most properties are located along the main road or transportation line. Transportation is vital for traveling. So, as a visitor to Beijing, you can expect that wherever you stay, that place will have properties with convenient transportation.
Because there is no zip code reference while using the average score of each neighborhood is too general and ambiguous, we decide to use a rounded latitude and longitude to group and calculate the average location score in a relatively small area to show which areas have higher ratings in Beijing. This map below provides a distribution of which locations have higher ratings in Beijing.
# load and read China's map to filter the map of Beijing
china_map_adm2 <- readOGR("./gadm36_CHN_shp/gadm36_CHN_3.shp",use_iconv = TRUE, encoding="UTF-8")
beijing_map <- subset(china_map_adm2,NAME_1=="Beijing")
beijing_map@data$id <- rownames(beijing_map@data)
beijingdata <- beijing_map@data
beijingmapdata <- broom::tidy(beijing_map)
# Which locations have better ratings?
locReviews <- listings %>%
mutate(lat = round(listings$latitude, 3),
long = round(listings$longitude, 3)) %>%
group_by(lat, long) %>%
summarise(avg_loc_review = mean(review_scores_location,
na.rm = TRUE))
colnames(locReviews) <- c("lat","long","LocationReviewScore")
fig_review <- ggplot()+
geom_polygon(data=beijingmapdata,
aes(x=long,y=lat,group=group),
fill = "white",
col="grey50", size = 0.25) +
theme_void()+
coord_map("polyconic") +
theme_map()+
geom_point(data=locReviews,
aes(x=long,y=lat, color = LocationReviewScore),
alpha=0.5, size=2)+
scale_color_gradient(low="#d3cbcb", high="#852eaa")+
theme(legend.position="right")
ggplotly(fig_review)
Actually, we find that the ratings in Beijing seem not to be very related to specific locations because every location has extremely high rating properties and also extremely low rating properties. So, it is not the location but some other specific aspects of the property that really matter.
Similar to the previous section, we use a rounded latitude and longitude to group and calculate the average price in a relatively small area to show which areas have higher prices in Beijing.
# Which locations have higher prices?
locPrices <- listings %>%
mutate(lat = round(listings$latitude, 3),
long = round(listings$longitude, 3)) %>%
group_by(lat, long) %>%
summarise(avg_loc_price = mean(price,
na.rm = TRUE))
colnames(locPrices) <- c("lat","long","LocationPrice")
fig_price <- ggplot()+
geom_polygon(data=beijingmapdata,
aes(x=long,y=lat,group=group),
fill = "white",
col="grey50", size = 0.25) +
theme_void()+
coord_map("polyconic") +
theme_map()+
geom_point(data=locPrices,
aes(x=long,y=lat, color = LocationPrice),
alpha=0.8, size=2)+
scale_color_gradient(low="#F1EEF6", high="#0571B0")+
theme(legend.position="right")
fig_price
However, at the very beginning, the average price on the plot is so blurry and similar that we failed to get any useful information. Then, by double-checking the data, we noticed that there are several extreme values above 5,000 in the price data set that damage the whole plot effect.
Afterwards, we ticked out those annoying prices and re-plot the average price on the map.
# Which locations have higher prices with filtered data?
locPrices <- listings %>%
mutate(lat = round(listings$latitude, 3),
long = round(listings$longitude, 3)) %>%
group_by(lat, long) %>%
filter(price <= 5000)%>%
summarise(avg_loc_price = mean(price,
na.rm = TRUE))
colnames(locPrices) <- c("lat","long","LocationPrice")
fig_price <- ggplot()+
geom_polygon(data=beijingmapdata,
aes(x=long,y=lat,group=group),
fill = "white",
col="grey50", size = 0.25) +
theme_void()+
coord_map("polyconic") +
theme_map()+
geom_point(data=locPrices,
aes(x=long,y=lat, color = LocationPrice),
alpha=0.8, size=2)+
scale_color_gradient(low="#F1EEF6", high="#0571B0")+
theme(legend.position="right")
ggplotly(fig_price)
We find that most of those expensive properties are located in the center of Beijing. It is not surprising since the city center usually has dense population, developed facilities and transportation, better living conditions and thus higher demand. However, we also notice that there are some expensive properties located in the southern and eastern suburb of Beijing. These could be villas or resort hotels.
In this section, we will analyse some seasonal changes related to the price from the Airbnb listings in Beijing to help customers to decide which time to travel is better and economical.
First, we look at the data from the ‘calendar’ to analysis how the occupancy will look like for the next year to make some predictions regarding the price. Theoretically, the more booking in advance, the higher the demand would be, thus the higher the price is.To visualise the percentage occupancy for the next year, we decide to use the calendar heat map which could provide an overall view of the reservation status. Surprisingly, the reservation rate is relatively low across 2022, nearly to 0 percentage. This might result from current unstable situation because of the pandemic. This makes people become reluctant to make reservations in advance as the sudden change to their travel plans may means a loss of booking deposit. So we cannot get any useful information for price prediction for 2022.
# which is the best date for the travel?
source("https://raw.githubusercontent.com/iascchen/VisHealth/master/R/calendarHeat.R")
#load and read the calendar dataset to predict the price of next year
calendar <- calendar %>% mutate(booked = ifelse(available=="f", 1, 0))
groupedCalendar <- calendar %>% group_by(date = date) %>% summarise(totalBooked = sum(booked, na.rm = TRUE), totalListings = n()) %>% mutate(percent_booked = (totalBooked/totalListings)*100)
calendarHeat(groupedCalendar$date, groupedCalendar$percent_booked, ncolors = 99, color = "g2r", varname="Occupancy (Percentage) by Month for Perdict")
Then we decide to use the data over the past years to make some predictions. We select the past three years from the ‘listing’ data and using last review date as the date to obtain the price at that time point.
We first want to explore whether there is any price difference between weekdays and weekend. So, we use a box plot with day of the week on x-axis and average prices on the y-axis. However, from the box plot, we can only find that the highest average price is on Saturday. But for the rest of days, the average price is just slightly different. In this way, we cannot identify which day may be the best for a cost-saving travel, but at least not Saturday.
price<-dplyr::filter(listings, !is.na(last_review)) #ignore all the NA value of last review
groupedPriceAll <- price %>% group_by(date = last_review) %>% summarise(averagePrice = mean(price, na.rm = TRUE)) %>% mutate(year = year(date), commonYear = paste("2019",substring(date, 6),sep="-"))
groupedPriceAll$year <- as.factor(as.character(groupedPriceAll$year))
groupedPriceAll$commonYear <- ymd(groupedPriceAll$commonYear)
groupedPriceAll <- groupedPriceAll%>%
mutate(day = strftime(date,'%A'))
groupedPriceAll$day <- factor(groupedPriceAll$day, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"), labels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
# Using boxplot to estimate the tendency in the week
ggplot(groupedPriceAll, aes(x = factor(day),
y = averagePrice)) +
geom_boxplot(fill = "#99BBFF", color = "black") +
geom_jitter(alpha = 0.05, width = 0.1, color = "#007A87") +
ggtitle("Which day is the most expensive during the week?",
subtitle = "Boxplots of Price by Day of the Week") +
labs(x = "Day of the week", y = "Average Price (RMB)") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
Since we cannot identify any further useful information in price change during a week, we now transform to the month pattern and apply the line chart for prediction, from which we want to identify which month may be the best time for people to take a cost-effective trip. However, since the average price fluctuated daily, we can not get any useful pattern here.
# which is the best month for the travel?
#Draw the line chart to show the relationship between month and average price during three years
ggplot(groupedPriceAll[year(groupedPriceAll$date) >= 2019 & year(groupedPriceAll$date) <= 2021,], aes(commonYear, averagePrice)) +
geom_line(na.rm=TRUE, alpha=0.5, color = "#99BBFF")+ facet_grid(~year)+
ggtitle("Seasonality in Price",
subtitle = "Average listing price(RMB) across Months") +
labs(x = "Month", y = "Average price across Listings") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) +
scale_x_date(labels = date_format("%b"))
Lastly, we decide to use scatter plot and and fitting curve to show the tendency across years, which may help us get some useful information for price predictions.
# which is the best month for the travel?
#Draw the scatter plot with fitting curve and using the date across 2019,2020,2021 in the listings table.
ggplot(groupedPriceAll[year(groupedPriceAll$date) >= 2019 & year(groupedPriceAll$date) <= 2021,], aes(commonYear, averagePrice)) +
geom_point(na.rm=TRUE, alpha=0.5, color = "#99BBFF") +
geom_smooth(color = "#FF5A5F")+ facet_grid(~year)+
ggtitle("Seasonality in Price",
subtitle = "Average listing price(RMB) across Months") +
labs(x = "Month", y = "Average price across Listings") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) + scale_x_date(labels = date_format("%b"))
We can find the price pattern of 2020 and 2021 are different from that in 2019. This might be due to the epidemic and restrictive travelling policies introduced from the beginning of 2020 which dramatically decreased the travel volumes and increased the price.
From the picture of 2019, which is the normal year without the epidemic, we can find that two upward trend in price. One is in June,as this is the time for summer holiday in China. And the other one is in December, which is the time of Chinese new year. During these two holiday times, people take more trips. Such high demand in accommodation booking improve the price significantly due to limited supply. So, for customers who prefer a cost-effective travel, these months should not be taken into consideration. Then, we can find the lowest average price is around October. This could be a good time for trip since the students need to return to school from the summer vacation and people go back to work. So you can get time off in lieu for trip.
Now we move to abnormal years starting from 2020. From the plot, we can find that there was nearly no observations from the February to June because of the epidemic and restrictive travelling policy introduced by the government. After July, we can find the average price showed a downward trend. We inferred that this could be possibly due to constantly changing situation due to the epidemic which made people postpone or cancel their travel plans. As the demand largely decreased, hosts might need to offer discounts and decrease their price to attract more customers.
Lastly, from picture of 2021, the price is high at the beginning which might be due to the Chinese New Year. Then the price also showed a downward trend which is similar to that of year 2020. But the price is even lower at the end of the year compared with 2020 due to the outbreak of Omicron, since people had to adjust their travelling plans due to flight cancellations or travel restrictions. Such lower demand in accommodation booking led to lower price.
After the price analysis for both normal and abnormal years, we would recommend the tourists to travel at the second-half of the years as the price would be generally lower compared with the first half. But they should be careful that from the start of 2022, the pandemic situation has become more stable in China, with the lift of travel restrictions across different areas, there might be a significant increase in price in the second-half year like in 2019.
To answer the question about what property type to book. We may first think about how amenities might vary across different property types. Since there are over 70 kinds of property type in Beijing, we will select 6 main property types including ‘Entire villa’, ‘Farm stay’,‘Kezhan’, ‘Minsu’,‘Room in boutique hotel’ and ‘Treehouse’ for the analysis. Since it is impossible to compare amenities by any numeric scores, we decide to use text mining to see to find the most frequently appeared words across these six property types to see whether there is any significant differences across these property types, which may provide guidance when customers have some particular needs.
layout(t(matrix(seq(1,12), nrow=3)), heights=c(1, 4, 1, 4))
par(mar=rep(0, 4))
plot.new()
text(x=0.5, y=0.5, "Hotel", cex=2)
plot.new()
text(x=0.5, y=0.5, "Kezhan", cex=2)
plot.new()
text(x=0.5, y=0.5, "Minsu", cex=2)
# Conduct text mining using word cloud clusters with 'hotel' data
hotelamenity <- prep_fun(as.character(hotel$amenities))
tokens_hotel <- space_tokenizer(hotelamenity)
it_hotel = itoken(tokens_hotel, progressbar = FALSE)
vocab_hotel <- create_vocabulary(it_hotel)
vectorizer_hotel <- vocab_vectorizer(vocab_hotel)
# use window of 5 for context words
tcm_hotel <- create_tcm(it_hotel, vectorizer_hotel,
skip_grams_window = 5L)
glove_hotel = GlobalVectors$new(rank = 30, x_max = 10)
word_vectors_hotel = glove_hotel$fit_transform(tcm_hotel, n_iter = 10)
word_vectors1_hotel <- glove_hotel$components
most_count_hotel <- as.character(
filter(vocab_hotel, doc_count == max(vocab_hotel$doc_count))[1,1])
p1_hotel = word_vectors_hotel[most_count_hotel, , drop = FALSE]
cos_sim_hotel = sim2(x = word_vectors_hotel, y = p1_hotel, method = "cosine", norm = "l2")
p1_hotel = sort(cos_sim_hotel[,1], decreasing = TRUE)
df_hotel = data.frame(item = as.character(names(p1_hotel)),freq = as.numeric(p1_hotel))
df_hotel$item = gsub(",","",df_hotel$item)
df_hotel = df_hotel[!duplicated(df_hotel$item), ]
set.seed(1234)
suppressWarnings(
wordcloud(words = df_hotel$item, freq = df_hotel$freq,
scale = c(3,0.3),max.words=80, random.order=FALSE,
rot.per=0.2,
colors = c("#e06f69","#357b8a",
"#7db5b8", "#59c6f3")))
# Conduct text mining using word cloud clusters with 'Kezhan' data
Kezhanamenity <- Kezhan
kezhanamenity <- prep_fun(as.character(Kezhan$amenities))
tokens_kezhan <- space_tokenizer(kezhanamenity)
it_kezhan = itoken(tokens_kezhan, progressbar = FALSE)
vocab_kezhan <- create_vocabulary(it_kezhan)
vectorizer_kezhan <- vocab_vectorizer(vocab_kezhan)
# use window of 5 for context words
tcm_kezhan <- create_tcm(it_kezhan, vectorizer_kezhan,
skip_grams_window = 5L)
glove_kezhan = GlobalVectors$new(rank = 30, x_max = 10)
word_vectors_kezhan = glove_kezhan$fit_transform(tcm_kezhan, n_iter = 10)
word_vectors1_kezhan <- glove_kezhan$components
most_count_kezhan <- as.character(
filter(vocab_kezhan, doc_count == max(vocab_kezhan$doc_count))[1,1])
p1_kezhan = word_vectors_kezhan[most_count_kezhan, , drop = FALSE]
cos_sim_kezhan = sim2(x = word_vectors_kezhan, y = p1_kezhan, method = "cosine", norm = "l2")
p1_kezhan = sort(cos_sim_kezhan[,1], decreasing = TRUE)
df_kezhan = data.frame(item = as.character(names(p1_kezhan)),freq = as.numeric(p1_kezhan))
df_kezhan$item = gsub(",","",df_kezhan$item)
df_kezhan = df_kezhan[!duplicated(df_kezhan$item), ]
set.seed(1234)
suppressWarnings(
wordcloud(words = df_kezhan$item, freq = df_kezhan$freq,
scale = c(3,0.3),max.words=80, random.order=FALSE,
rot.per=0.2,
colors = c("#e06f69","#357b8a",
"#7db5b8", "#59c6f3")))
# Conduct text mining using word cloud clusters with 'Minsu' data
Minsuamenity <- Minsu
minsuamenity <- prep_fun(as.character(Minsu$amenities))
tokens_minsu <- space_tokenizer(minsuamenity)
it_minsu = itoken(tokens_minsu, progressbar = FALSE)
vocab_minsu <- create_vocabulary(it_minsu)
vectorizer_minsu <- vocab_vectorizer(vocab_minsu)
# use window of 5 for context words
tcm_minsu <- create_tcm(it_minsu, vectorizer_minsu,
skip_grams_window = 5L)
glove_minsu = GlobalVectors$new(rank = 30, x_max = 10)
word_vectors_minsu = glove_minsu$fit_transform(tcm_minsu, n_iter = 10)
word_vectors1_minsu <- glove_minsu$components
most_count_minsu <- as.character(
filter(vocab_minsu, doc_count == max(vocab_minsu$doc_count))[1,1])
p1_minsu = word_vectors_minsu[most_count_minsu, , drop = FALSE]
cos_sim_minsu = sim2(x = word_vectors_minsu, y = p1_minsu, method = "cosine", norm = "l2")
p1_minsu = sort(cos_sim_minsu[,1], decreasing = TRUE)
df_minsu = data.frame(item = as.character(names(p1_minsu)),freq = as.numeric(p1_minsu))
df_minsu$item = gsub(",","",df_minsu$item)
df_minsu = df_minsu[!duplicated(df_minsu$item), ]
set.seed(1234)
suppressWarnings(
wordcloud(words = df_minsu$item, freq = df_minsu$freq,
scale = c(3,0.3),max.words=80, random.order=FALSE,
rot.per=0.2,
colors = c("#e06f69","#357b8a",
"#7db5b8", "#59c6f3")))
plot.new()
text(x=0.5, y=0.5, "Entirevilla", cex=2)
plot.new()
text(x=0.5, y=0.5, "Homestay", cex=2)
plot.new()
text(x=0.5, y=0.5, "Treehouse", cex=2)
# Conduct text mining using word cloud clusters with 'Entirevilla' data
entirevillaamenity <- Entirevilla
entirevillaamenity <- prep_fun(as.character(Entirevilla$amenities))
tokens_entirevilla <- space_tokenizer(entirevillaamenity)
it_entirevilla = itoken(tokens_entirevilla, progressbar = FALSE)
vocab_entirevilla <- create_vocabulary(it_entirevilla)
vectorizer_entirevilla <- vocab_vectorizer(vocab_entirevilla)
# use window of 5 for context words
tcm_entirevilla <- create_tcm(it_entirevilla, vectorizer_entirevilla,
skip_grams_window = 5L)
glove_entirevilla = GlobalVectors$new(rank = 30, x_max = 10)
word_vectors_entirevilla = glove_entirevilla$fit_transform(tcm_entirevilla, n_iter = 10)
Error in glove_entirevilla$fit_transform(tcm_entirevilla, n_iter = 10): Cost is too big, probably something goes wrong... try smaller learning rate
word_vectors_entirevilla = glove_entirevilla$fit_transform(tcm_entirevilla, n_iter = 10)
word_vectors1_entirevilla <- glove_entirevilla$components
most_count_entirevilla <- as.character(
filter(vocab_entirevilla, doc_count == max(vocab_entirevilla$doc_count))[1,1])
p1_entirevilla = word_vectors_entirevilla[most_count_entirevilla, , drop = FALSE]
cos_sim_entirevilla = sim2(x = word_vectors_entirevilla, y = p1_entirevilla, method = "cosine", norm = "l2")
p1_entirevilla = sort(cos_sim_entirevilla[,1], decreasing = TRUE)
df_entirevilla = data.frame(item = as.character(names(p1_entirevilla)),freq = as.numeric(p1_entirevilla))
df_entirevilla$item = gsub(",","",df_entirevilla$item)
df_entirevilla = df_entirevilla[!duplicated(df_entirevilla$item), ]
set.seed(1234)
suppressWarnings(
wordcloud(words = df_entirevilla$item, freq = df_entirevilla$freq,
scale = c(3,0.3),max.words=80, random.order=FALSE,
rot.per=0.2,
colors = c("#e06f69","#357b8a",
"#7db5b8", "#59c6f3")))
# Conduct text mining using word cloud clusters with 'Farmstay' data
Farmstayamenity <- Farmstay
farmstayamenity <- prep_fun(as.character(Farmstay$amenities))
tokens_farmstay <- space_tokenizer(farmstayamenity)
it_farmstay = itoken(tokens_farmstay, progressbar = FALSE)
vocab_farmstay <- create_vocabulary(it_farmstay)
vectorizer_farmstay <- vocab_vectorizer(vocab_farmstay)
# use window of 5 for context words
tcm_farmstay <- create_tcm(it_farmstay, vectorizer_farmstay,
skip_grams_window = 5L)
glove_farmstay = GlobalVectors$new(rank = 30, x_max = 10)
word_vectors_farmstay = glove_farmstay$fit_transform(tcm_farmstay, n_iter = 10)
word_vectors1_farmstay <- glove_farmstay$components
most_count_farmstay <- as.character(
filter(vocab_farmstay, doc_count == max(vocab_farmstay$doc_count))[1,1])
p1_farmstay = word_vectors_farmstay[most_count_farmstay, , drop = FALSE]
cos_sim_farmstay = sim2(x = word_vectors_farmstay, y = p1_farmstay, method = "cosine", norm = "l2")
p1_farmstay = sort(cos_sim_farmstay[,1], decreasing = TRUE)
df_farmstay = data.frame(item = as.character(names(p1_farmstay)),freq = as.numeric(p1_farmstay))
df_farmstay$item = gsub(",","",df_farmstay$item)
df_farmstay = df_farmstay[!duplicated(df_farmstay$item), ]
set.seed(1234)
suppressWarnings(
wordcloud(words = df_farmstay$item, freq = df_farmstay$freq,
scale = c(3,0.3),max.words=80, random.order=FALSE,
rot.per=0.2,
colors = c("#e06f69","#357b8a",
"#7db5b8", "#59c6f3")))
# Conduct text mining using word cloud clusters with 'Treehouse' data
Treehouseamenity <- Treehouse
treehouseamenity <- prep_fun(as.character(Treehouse$amenities))
tokens_treehouse <- space_tokenizer(treehouseamenity)
it_treehouse = itoken(tokens_treehouse, progressbar = FALSE)
vocab_treehouse <- create_vocabulary(it_treehouse)
vectorizer_treehouse <- vocab_vectorizer(vocab_treehouse)
# use window of 5 for context words
tcm_treehouse <- create_tcm(it_treehouse, vectorizer_treehouse,
skip_grams_window = 5L)
glove_treehouse = GlobalVectors$new(rank = 30, x_max = 10)
word_vectors_treehouse = glove_treehouse$fit_transform(tcm_treehouse, n_iter = 10)
word_vectors1_treehouse <- glove_treehouse$components
most_count_treehouse <- as.character(
filter(vocab_treehouse, doc_count == max(vocab_treehouse$doc_count))[1,1])
p1_treehouse = word_vectors_treehouse[most_count_treehouse, , drop = FALSE]
cos_sim_treehouse = sim2(x = word_vectors_treehouse, y = p1_treehouse, method = "cosine", norm = "l2")
p1_treehouse = sort(cos_sim_treehouse[,1], decreasing = TRUE)
df_treehouse = data.frame(item = as.character(names(p1_treehouse)),freq = as.numeric(p1_treehouse))
df_treehouse$item = gsub(",","",df_treehouse$item)
df_treehouse = df_treehouse[!duplicated(df_treehouse$item), ]
set.seed(1234)
suppressWarnings(
wordcloud(words = df_treehouse$item, freq = df_treehouse$freq,
scale = c(3,0.3),max.words=80, random.order=FALSE,
rot.per=0.2,
colors = c("#e06f69","#357b8a",
"#7db5b8", "#59c6f3")))
From the textual data plots of amenities, we find that there is not much difference in key amenities across these six main property types, although they boost their different living experience. And the most frequently appeared words are ‘parking’, ‘hot water’ and ‘Wifi’, but these amenities are usually the most basic requirement of customers. So the visualisation in terms of key amenities could not provide a straightforward contrast among these property types. Maybe with the service development of modern accommodation system, different hosts have aimed to improve their amenities apart from customers’ basic requirements to better meet their personal needs. As such, the textual data visualisations here cannot directly provide any obvious differences across six property types, which may not provide useful guidance to customers when making choices.
Given that there is not much difference in amenities, we decide to focus on two most basic factors that the customers value most including price and review score. We first use a scatter plot with price on the x-axis and review score on the y-axis to have a look at whether there is any difference of these two features across these property types. We find that for Minsu and Treehouse, many observations are removed due to missing values, which could not provide us useful insight. And the points distributed for the other four types are very similar, which could not help us to see any significant difference among them.
Then we compute the average of both price and review scores of these property types. We decide to use bar chart together with geometric line to show the difference. Compared with the scatter plot, the bar chart becomes much clear to see. We found that Kezhan and Room in boutique hotel has the lowest average price across six property types. While the Treehouse has the highest. When looking at the average review score, there is not significant changes across different property types, with only Minsu a bit higher. However, for customers who want to find a value-for-money living condition, we would recommend booking Kezhan instead of common hotel, as Kezhan (small-scale Chinese-style homestays) is full of Beijing’s local customs which would provide customers with unforgettable experience.
In terms of location, most of properties are located along the main road or transportation line which provides a convenient way for travelling around. So the property location is not such an important factor when deciding where to live in Beijing. We recommend tourists to pay more attention on the accommodation itself in terms of service, environment and so on to find a place best meets their personal needs.
In the analysis of seasonal change in price using the data from 2019 to 2021, we recommend the tourists to travel at the second-half of the years as the price would be generally lower. But with the pandemic situation becoming much stable in China and many travel restrictions lifted from the start of 2022, there might be a significant increase in price in the second-half year as in normal year 2019. So tourists who care more about the value-for-money should plan ahead in case of significant price increases.
Regarding which property type to book, we find there are nearly no significant difference in key amenities among the selected six property types using text mining. We then focus on two basic factors that the customers value most including price and review score. However, we did not obtain much insight from the scatter plot of the raw data as the point distribution is quite similar. We finally compute the average value of both factors. With the help of bar chart together with review score line, we find Kezhan and Hotel with reasonable review scores have comparably lowest price. So for customers who want to find a value-for-money living condition and experience some local customs, we would recommend to book Kezhan instead of common hotels, since Kezhan, small-scale Chinese-style homestays, will provide tourist with an unforgettable cultural journey.